Gradient descent#
Algorithm (Gradient descent)
To solve the optimization problem
do the following steps:
initialize \(\boldsymbol w\) by some random values (e.g., from \(\mathcal N(0, 1\)))
choose tolerance \(\varepsilon > 0\) and learning rate \(\eta > 0\)
while \(\Vert \nabla\mathcal L(\boldsymbol w) \Vert > \varepsilon\) do the gradient step
\[ \boldsymbol w := \boldsymbol w - \eta\nabla\mathcal L(\boldsymbol w) \]return \(\boldsymbol w\)
Fig. 1 Main idea of gradient descent algorithm#
The algorithm initializes parameters randomly, sets hyperparameters such as tolerance and learning rate, and iteratively updates the parameters using gradient descent until convergence, ultimately returning the optimized parameter vector. This general framework is widely used for solving optimization problems in machine learning and other fields.
Fig. 2 Iteration process of gradient descent#
Note
Learning rate determines the size of the steps taken during the optimization and influences the convergence speed, thereby, a smaller learning rate may lead to more precise results but slower convergence.
Note
If condition \(\Vert \nabla\mathcal L(\boldsymbol w) \Vert > \varepsilon\) holds for too long, the loop in step 3 terminates after some number iterations max_iter.
Example#
Let’s consider the simple quadratic function \(f(x) = x^2\). The purpose is to demonstrate a function where gradient descent can converge to the minimum in just a few steps.
Calculate the Gradient: \(\nabla f(x) = 2x\)
Initialize Parameters: Choose an initial value for x, e.g., \(x_{\text{old}} = 3.0\).
Set Learning Rate: Choose a small learning rate, e.g., \(\alpha = 0.1\).
Update rule: \(x_{\text{new}} = x_{\text{old}} - \alpha \cdot \nabla f(x_{\text{old}})\)
Iterative updates:
Iteration 1: \(\nabla f(x_{\text{old}}) = 2 \cdot 3.0 = 6.0\) \(x_{\text{new}} = x_{\text{old}} - 0.1 \cdot 6.0 = 3.0 - 0.6 = 2.4\) Result: \(x_{\text{new}} = 2.4, f(x_{\text{new}}) = (2.4)^2 = 5.76\)
Iteration 2: \(\nabla f(x_{\text{old}}) = 2 \cdot 2.4 = 4.8\) \(x_{\text{new}} = x_{\text{old}} - 0.1 \cdot 4.8 = 2.4 - 0.48 = 1.92\) Result: \(x_{\text{new}} = 1.92, f(x_{\text{new}}) = (1.92)^2 = 3.6864\)
\[...\]Final result: Iteration n: \(\nabla f(x_{\text{n-1}}) = 2 \cdot x_{\text{n-1}}\) \(x_{\text{n}} = x_{\text{n-1}} - 0.1 \cdot \nabla f(x_{\text{n-1}})\) Result: \(f(x_{\text{n}}) = (x_{\text{n}})^2\)
The process should show that the value of \(x\) approaches the minimum \((x=0)\) relatively quickly due to the simplicity of the quadratic function.
Nevertheless, the gradient descent has its drawbacks. It should be highlighted that 2 notable issues can be caused in practical applications. Similar to any vector, the negative gradient comprises both a direction and a magnitude. Depending on the specific function undergoing minimization, either one or both of these characteristics may pose challenges when employing the negative gradient as a descent direction.
Vanishing behavior of gradient magnitude#
Below we show an example run of gradient descent using a function
whose minimum is at the origin w=0. This example highlights how the gradient magnitude influences step length in gradient descent. Steps are initially large away from a stationary point but become small, resembling crawling, near the function minimum. A deliberately set step length parameter \(\alpha = 10^{-1}\) for 10 steps accentuates this behavior. The left panel shows the original function, and the right panel displays steps from start (green) to final (red). The natural crawling near the minimum, due to vanishing gradient magnitude, hampers quick progress, as reflected in the cost function history plot.
Fig. 3 Behavior of gradient descent near the minimum of a function#
The other case of improper performance of gradient descent is the slow-crawling behavior of gradient descent. The information can be read in the (provided source).
Adding Momentum#
Purpose of Momentum
When employing gradient descent, several challenges arise:
Becoming stuck at a local minimum, a consequence of the algorithm’s greediness.
Overshooting and overlooking the global optimum due to excessively rapid movement along the gradient direction.
Oscillation, a phenomenon occurring when the function’s value remains relatively constant, resembling navigation on a plateau where the height remains the same regardless of the direction.
To address these issues, a momentum term denoted as \(\alpha\) is introduced into the expression for \(\Delta \textbf{w}\) to stabilise the learning rate while approaching the global optimum value.
In the following, the superscript i indicates the iteration number:
From that we can derive the formula for momentum term:
Gradient Descent for a Univariate Function#
Let’s initialize the objective function and its derivative. And set up gradient descent function.
Note
Derivative \(f'(x)\) is used to understand how function changes at a specific point, as it helps to figure out direction and speed toward function’s minimum.
def f(x):
return x ** 4 - 4 * x ** 2 + 5 * x
def derivative_f(x):
return 4 * x ** 3 - 8 * x + 5
def gradient_descent(iterations_limit, threshold, start, obj_func, derivative_f, learning_rate=0.05, momentum=0.5):
point = start
points = [start]
values = [obj_func(start)]
delta = 0
i = 0
diff = 1.0e10
while i < iterations_limit and diff > threshold:
delta = -learning_rate * derivative_f(point) + momentum * delta
point += delta
points.append(point)
values.append(obj_func(point))
diff = abs(values[-1] - values[-2])
i += 1
return points, values
start_point = 2
learning_rate = 0.05
momentum = 0.3
points, values = gradient_descent(100, 0.1, start_point, f, derivative_f, learning_rate, momentum)
Graphs below visualizes the gradient descent optimization process for given objective function using Plotly library. It allows users to interactively explore how the optimization trajectory changes based on different learning rates and momentum values.
Gradient Descent for a Bivariate Function#
The same procedure was applied similarly to the case of univariate function considered earlier. But now let’s consider the function of two variable, because real-world scenarios mostly involve complex relationships between various factors. Therefore, further computations involve partial derivatives and finding a local minimum.
Note
Partial derivatives tell us how much a function changes with respect to just one variable, while keeping all other variables constant. In our case, we have gradient, which is a vector of partial derivatives of \(\mathcal{L}(\boldsymbol w)\) with respect to each point w:
def f(x, y):
return x ** 3 + x ** 2 - y ** 3 - 4 * x + 22 * y - 5
def derivative_f_x(x):
return 3 * x ** 2 + 2 * x - 4
def derivative_f_y(y):
return -3 * y ** 2 + 22
def gradient_descent(iterations_limit, threshold, start, obj_func, derivative_f_x, derivative_f_y, learning_rate=0.05,
momentum=0.5):
point = start
points = [start]
values = [obj_func(*start)]
x = point[0]
y = point[1]
delta_x = 0
delta_y = 0
i = 0
diff = 1.0e10
while i < iterations_limit and diff > threshold:
delta_x = -learning_rate * derivative_f_x(x) + momentum * delta_x
delta_y = -learning_rate * derivative_f_y(y) + momentum * delta_x
x += delta_x
y += delta_y
points.append([x, y])
values.append(obj_func(*[x, y]))
diff = abs(values[-1] - values[-2])
i += 1
return points, values
start_point = [4.5, 2]
learning_rate = 0.05
momentum = 0.5
points, values = gradient_descent(10, 0.01, start_point, f, derivative_f_x, derivative_f_y, learning_rate, momentum)
3D plots can visualize a trajectory of the optimization process on a surface for a bivariate function. This can help to see how the algorithm moves toward the minimum.
Alternative implementation of a gradient_descent_3d_visualization_plot function provides a more detailed visualization by annotating the gradient descent path with connecting lines between the points.
Below is implemented and drawn a 2d counter plot of previously implemented 3d visualization. On a 2d space the contour lines represent the objective function, and the annotated points show the steps taken by the algorithm during optimization.
Note
A 2D representation of 3D space makes it easier to interpret the function’s features and the trajectory of the optimization algorithm.
Gradient Descent for a Linear regression model#
To implement the gradient descent for linear regression model we follow the steps below:
Import testing dataset
Set manually intercept and slope values for a linear regression model
Implement Mean square error (MSE) (see regression metrics section) function and compute error value for the model.
from sklearn.model_selection import train_test_split
import pandas as pd
dataset_path = 'datasets/experience_salary.csv'
data = pd.read_csv(dataset_path)
intercept = 1
slope = 2
def linear_regression(x, intercept, slope):
return intercept + slope * x
def mean_squared_error(x, intercept, slope, y_actual):
n = len(x)
total_error = 0
for i in range(n):
total_error += + (intercept + slope * x[i] - y_actual[i]) ** 2
return total_error / n
experience = data[['experience']]
salary = data['salary']
X_train, X_test, Y_train, Y_test = train_test_split(experience, salary, test_size=0.2, random_state=42)
X = X_test.values.flatten()
Y = Y_test.values.flatten()
linear_regression_values = [linear_regression(x, intercept, slope) for x in X]
mse = mean_squared_error(X_test.values.flatten(), intercept, slope, Y)
print(f'MSE: {mse}')
MSE: 876.7019407082549
Below is a visualization of best-fit linear relationship according to our model. Helper functions plot_points and plot_line functions are used to create traces for testing data points and the regression line. Then we return a go.Figure object with the specified data and layout.
Defined function that computes partial derivatives with respect to intercept and slope:
We iterate over each data point in the input arrays x, y_actual
Computes the partial derivatives of the mean squared error with respect to the intercept (derivative_intercept) and slope (derivative_slope) using the formula for the derivative of the mean squared error:
For the intercept: \(\frac{\partial}{\partial \text{intercept}} \text{MSE} = \frac{2}{N} \sum_{i=1}^{N} (\text{intercept} + \text{slope} \cdot x[i] - \text{y_actual}[i])\)
For the slope: \(\frac{\partial}{\partial \text{slope}} \text{MSE} = \frac{2}{N} \sum_{i=1}^{N} (\text{intercept} + \text{slope} \cdot x[i] - \text{y_actual}[i]) \cdot x[i])\)
Return the average derivatives by dividing the accumulated derivatives by the number of data points n
def derivative_mean_squared_error(x, y_actual, intercept, slope):
n = len(x)
derivative_intercept = 0
derivative_slope = 0
for i in range(n):
derivative_intercept += 2 * (intercept + slope * x[i] - y_actual[i])
derivative_slope += 2 * (intercept + slope * x[i] - y_actual[i]) * x[i]
return derivative_intercept / n, derivative_slope / n
Function below allows us getting updated values of intercept and slope parameter. The update involves subtracting the product of the learning rate and the respective partial derivative from the current values of the intercept and slope.
def gradient_descent(x, y, intercept, slope, learning_rate, num_iterations):
for i in range(num_iterations):
partial_derivative_intercept, partial_derivative_slope = derivative_mean_squared_error(x, y, intercept, slope)
intercept -= learning_rate * partial_derivative_intercept
slope -= learning_rate * partial_derivative_slope
return intercept, slope
To call the gradient_descent function, initially we set arbitrary values for hyperparameters learning_rate and amount of iterations.
In each iteration, the algorithm calculates the gradient of the cost function with respect to each parameter. The gradient points in the direction of the steepest increase in the cost.
The parameters are then updated in the opposite direction of the gradient to move towards the minimum of the mean square error function. The update is proportional to the learning_rate, a hyperparameter that determines the size of the steps taken in each iteration.
learning_rate = 0.001
iterations = 1000
gd_intercept, gd_slope = gradient_descent(X, Y, intercept, slope, learning_rate, iterations)
print(gd_intercept, gd_slope)
1.9080791607526073 0.9313247403333909
The code below prints the mean squared error and the predicted values for the testing data using the linear_regression model with gradient descent-optimized parameters.
gd_mse = mean_squared_error(X, gd_intercept, gd_slope, Y)
print(f'MSE with gradient descent: {gd_mse}')
gd_linear_regression_values = [linear_regression(x, gd_intercept, gd_slope) for x in X]
MSE with gradient descent: 29.349835951481545
Test the Mean square error (MSE) result on built-in linear regression model from sklearn library.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
model = LinearRegression()
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
mse = mean_squared_error(Y_test, Y_pred)
print(f'Sklearn mse {mse}')
Sklearn mse 27.65026873284228
Let’s define a function that draws a fit-line linear relationship after having applied the optimization.